Processing method and apparatus for implementing systolic arrays

ABSTRACT

The present invention relates to a processing method and apparatus for implementing a systolic-array-like structure. Input data are stored in a depth-configurable register means (DCF) in a predetermined sequence, and are supplied to a processing means (FU) for processing said input data based on control signals generated from instruction data, wherein the depth of the register means (DCF) is controlled in accordance with the instruction data. Thereby, systolic arrays can be mapped onto a programmable processor, e.g. a VLIW processor, without the need for explicitly issuing operations to implement the register moves that constitute the delay lines of the array.

The present invention relates to a processing method and apparatus,especially a scaleable VLIW (Very Large Instruction Word) processor or acoarse-grained reconfigurable processor for implementingsystolic-array-like structures.

Programmable or configurable processors are pre-fabricated devices thatcan be customised after fabrication to perform a specific function basedon instructions or configurations, respectively, issued to it. Theseinstructions or configurations, when executed in the processor, controlthe processor resources (e.g. arithmetic logic unit (ALU), registerfile, interconnection, memory, etc.) to perform certain operations intime (i.e. sequentially) or space (i.e. in parallel). Typically,configurable processors will perform more operations in space thanprogrammable processors, while programmable processors will perform moreoperations in time than configurable processors.

An algorithm-to-silicon design methodology for digital signal processors(DSP) has been developed, which allows for an enormous increase indesign productivity of a DSP designer and a more optimised design of theresulting chip. The methodology initially involves capturing analgorithm in an implementation-independent way. Then, with the help of aset of evaluators and analysers, the algorithm can be tuned andoptimised for a fixed-point implementation. Once a satisfactorybehaviour is reached, a set of interactive synthesis engines can beapplied to map the fixed-point specification to a target VLIW-likearchitecture. This mapping process is very flexible and fast, whichmakes it possible to try out many alternatives in a very short time. Ingeneral, a very large instance of such a VLIW-like processorarchitecture can be seen as a coarse-grained reconfigurable processor,in which each control word in its micro-code memory is a configuration.This interpretation is possible due to the size of the correspondingVLIW instruction, which allows for many parallel operations to beperformed, therefore largely computing in space.

VLIW processors are used to exploit the available Instruction LevelParallelism (ILP) in an application. To exploit the ILP,data-independent operations are scheduled concurrently in a VLIWinstruction.

FIG. 1 shows a schematic diagram indicating a processing application anda corresponding programmable processor structure of an application,where a data-flow graph representing a loop body is shown on the leftside. In FIG. 1, circles 20 represent operations, and arrows representdata dependencies between operations. Dashed arrows represent input oroutput values respectively consumed or produced in a loop iteration. Onthe right-hand side, a 4-issue slot VLIW processor 10 is depicted,comprising four ALUs A1 to A4 and four issue slots I1 to 14 forcontrolling the operation of the ALUs A1 to A4. In the present case, theVLIW processor 10 can compute one iteration of the indicated loopprocessing application in five cycles, executing a sequence of two,four, two, one, and one operation(s) in each cycle, respectively. Thenumber of operations per cycle depends on the number of operations whichcan be processed concurrently or in parallel, i.e. shown within onehorizontal line of the processing application. The partial area 30 ofthe processing application illustrates the situation in the secondcycle, in which four operations are executed in parallel, in one cycleof the VLIW processor 10.

Note that ILP is exploited within the loop body for a single iterationof the loop. Techniques of software pipelining can be used to exploitILP across loop iterations, but those are typically difficult toimplement and are mostly effective only for very simple and small loops,e.g. single basic blocks.

Custom hardware, however, can overlap the execution of every iterationof a loop, keeping most computing resources busy at all cycles. Thiskind of implementation exploits data locality and pipelining to theextreme. It is known as systolic arrays. FIG. 2 shows a schematicdiagram indicating a systolic array implementation of the final two tapsof a digital filter application, e.g. an FIR (Finite Impulse response)filter which can generate an output sample at every cycle. The greyblocks are clocked registers R. All function units FU are also busy atevery cycle. Input data i is processed locally as it goes down the“pipe” to the right to generate output data o, as in a “pulsating”assembly line. The line acc contains partial accumulations. Registers ccontain coefficients to the multipliers. Therefore, this architecture iscalled “systolic” array. Systolic arrays allow for very highexploitation of parallelism, obtaining high throughput.

In Zapata et al, “A VLSI constant geometry architecture for the fastHartley and Fourier transforms”, IEEE Transactions on Parallel andDistributed Systems, Vol. 3, No. 1, pp. 58-70, January 1992, anorganization of a processor memory is based on first-in-first-out (FIFO)queues to facilitate a systolic data flow and to permit implementationin a direct way of complex data movements and address sequences of thetransforms. This is accomplished by means of simple multiplexingoperations, using hardware control.

Hence, it is, in principle, possible to map systolic arrays onto a VLIWprocessor. Then, each function unit FU in the systolic array willcorrespond to an equivalent unit (e.g. ALU, multiplier, MAC, etc.) inthe VLIW processor and will be allocated one issue slot. For thesystolic array of FIG. 2, four issue slots would be required in the VLIWprocessor for the four function units FU. In addition, one register moveunit would be required in the VLIW processor for each register movecorresponding to a delay line in the systolic array, with itscorresponding issue slot. In the systolic array of FIG. 2, sevenregister moves that correspond to delay lines are provided. Therefore,seven register move units would be required in the VLIW processor, withtheir additional seven issue slots. This way, there would be more issueslots and, therefore, control signals and associated circuitry,corresponding to register moves than to actual operations. Also, theneed for the move units to access the same registers that need to beaccessed by other function units, introduces architectural complicationsin the VLIW design. All this renders the VLIW implementation of systolicarrays impractical. In this respect, it is noted that, in the originalsystolic array, register moves are encoded in space, by means of FIFOlines of registers that can implement the delay lines without anyexplicit control.

It is an object of the present invention to enable implementation ofsystolic array structures by a programmable processor.

This object is achieved by a processing apparatus as claimed in claim 1and by a processing method as claimed in claim 8.

Accordingly, a programmable processor template for implementing systolicarrays can be achieved by providing a depth-configurable register meansat the input of the processing units. Due to the possible implementationof systolic array structures by programmable processors, e.g. VLIWprocessors, hardware-like performance, mainly throughput, can beprovided for media intensive applications, like video streaming, whilepreserving the flexibility and programmability of a well-known processorparadigm. It could even be possible to get a compiler to automaticallygenerate “systolic array-like” instruction schedules, without need forexplicit hardware design. Compilation technology could be extended inthis direction.

Thus, a cost-effective VLIW template can be provided for the mapping ofsystolic structures. This template considerably reduces the overheadcreated by the current need to explicitly control all register moveoperations corresponding to delay lines.

Preferably, the register means may comprise distributed register filesprovided at each input terminal of a plurality of functional units ofthe processing means. In particular, the distributed register files maycomprise depth-configurable FIFO register files addressable forindividual registers. The number of physical registers available isfixed by the hardware. Then, the register control means may be arrangedto determine the last logical register of the FIFO register files basedon control signals derived from the instruction data.

Furthermore, at least one issue slot may be provided for storing theinstruction data. The register control means may be arranged to use apart of the bit pattern of the instruction data stored in the at leastone issue slot for controlling the depth of the register means.

Other advantageous further developments are defined in the dependentclaims.

In the following, the present invention will be described on the basisof a preferred embodiment with reference to the accompanying drawings inwhich:

FIG. 1 shows a schematic diagram of a processing application and acorresponding programmable processor structure;

FIG. 2 shows a schematic diagram of a systolic array architecture;

FIG. 3 shows a principle architecture for implementing the systolicarray architecture of FIG. 2 in a programmable processor according tothe present invention; and

FIG. 4 shows a programmable processor architecture according to thepreferred embodiment for implementing systolic arrays.

The preferred embodiment will now be described on the basis of a VLIWprocessor architecture.

In FIG. 3, the systolic array of FIG. 2 is restructured to enable itsimplementation in a VLIW architecture. Issue slots I1 to I4 are madeexplicit, and first-in-first-out (FIFO) delay lines comprising registersR are preserved at the input terminals of functional units FU, e.g.ALUs. Dotted boxes represent physical registers available in thehardware but not used in the shown systolic configuration. Drawn thisway, the scheme suggests a VLIW template that can efficiently mapsystolic structures. The intuitive concept illustrated in FIG. 3 can begeneralised by providing distributed register files at each input of thefunctional units FU.

FIG. 4 illustrates a programmable processor architecture according tothe preferred embodiment as a VLIW template which can efficiently mapsystolic structures. In particular, distributed register files DCF areprovided, one for each input of each function unit FU. Additionally, aninterconnect network consisting of several point-to-point lines isprovided and connected to the respective inputs of the functional unitsby input or output multiplexers 50. Thereby, the point-to-point linescan be written to by a single predetermined function unit FU. AlthoughFIG. 4 suggests full connectivity, the interconnection bus does not needto be fully connected. Furthermore, each input of a functional unit FUcan be connected to a standard register file RF, addressable forindividual registers. Note that in FIG. 4, for simplicity, only theright one of the inputs of each functional unit FU is shown connected toa respective standard register file RF. Register files with multipleread and/or write ports are also possible.

Due to the fact that the template does not include any centralisedstructure, i.e. all resources are distributed, it is scaleable, allowingfor very high number of issue slots potentially needed by large systolicarrays, e.g. a 16-tap FIR filter or a large matrix multiplier.

According to the preferred embodiment, a depth-configurable registerfile DCF is arranged at each input of each function unit FU. Thedepth-configurable register files DCF may be implemented by FIFOmemories whose last logical register can be determined by controlsignals. However, any other addressable or controllable memory orregister structure capable of determining a last logical storageposition in a delay line based on control or address signals can be usedfor implementing the depth-configurable register files DCF.

For a depth-configurable FIFO of N physical registers, the output of theFIFO can be programmed to be at register N, N-1, N-2, . . . 1. Bycontrolling the depth of the FIFO, we can control the number of delaylines it emulates. In FIG. 3, for instance, if the leftmost FIFO had 4physical registers R, the leftmost depth-controlled register file DCF ofFIG. 4 would be controlled by the control signal at the leftmost issueslot I1 so as to place its output terminal at the second register (N-2,N=4), while the lower two registers (N,N-1) remain unused. Thus, thecontrol signals controlling the depth of the depth-controlled registerfiles DCF are part of the bit patterns in the corresponding issue slotsI1 to I4.

In summary, a programmable processor template for implementing systolicarrays can be achieved by providing a depth-configurable memory orregister file DCF at the input terminals of each function unit FU. Thedepth of the depth-configurable register file DCF is controlled e.g. byrespective bits loaded in the corresponding issue slot. With thisaugmentation, systolic arrays can now be mapped onto a programmableprocessor, e.g. a VLIW processor, without the need for explicitlyissuing operations to implement the register moves that constitute thedelay lines of the array. The proposed template can be configured toimplement a variety of systolic arrays. It provides for a coarse-grainedreconfigurable fabric that allows for hardware-like data throughput, atthe same time that it preserves the programmability of a processor.

It is to be noted that the present invention is not restricted to thepreferred embodiment but can be used in any programmable orreconfigurable data processing architecture so as to implement systolicor other pipeline architectures.

1. A processing apparatus for implementing a systolic-array-likestructure, said apparatus comprising: a) input means for inputting data;b) register means for storing said input data in a predeterminedsequence; c) processing means for processing data received from saidregister means based on control signals generated from instruction data;and d) register control means for controlling the depth of said registermeans in accordance with said instruction data.
 2. An apparatusaccording to claim 1, wherein said register means comprises distributedregister files provided at input terminals of a plurality of functionalunits of said processing means.
 3. An apparatus according to claim 2,wherein said distributed register files comprise depth-configurable FIFOregister files addressable for individual registers.
 4. An apparatusaccording to claim 3, wherein said register control means are arrangedto determine the last logical register of said FIFO register files basedon control signals derived from said instruction data.
 5. An apparatusaccording to claim 1, further comprising at least one issue slot forstoring said instruction data.
 6. An apparatus according to claim 5,wherein said register control means are arranged to use a part of thebit pattern of said instruction data stored in said at least one issueslot for controlling said depth of said register means.
 7. An apparatusaccording to claim 1, wherein said programmable processing apparatus isa scalable VLIW processor or a coarse-grained reconfigurable processor.8. An apparatus according to claim 1, wherein said distributed registerfiles are connected to an interconnect network made up of a plurality ofpoint-to-point connection lines.
 9. An apparatus according to claim 8,wherein said point-to-point interconnect lines have a single source. 10.An apparatus according to claim 8, wherein said interconnect network ispartially connected.
 11. A processing method for implementing asystolic-array-like structure, said method comprising: a) storing saidinput data in a register file in predetermined sequence; b) processingdata received from said register file based on control signals generatedfrom instruction data; and c) controlling the depth of said registerfile in accordance with said instruction data.